Skip to content

Add spark-gluten-clickhouse entry (Spark + Gluten with the CH backend)#861

Open
alexey-milovidov wants to merge 1 commit intomainfrom
add-spark-gluten-clickhouse
Open

Add spark-gluten-clickhouse entry (Spark + Gluten with the CH backend)#861
alexey-milovidov wants to merge 1 commit intomainfrom
add-spark-gluten-clickhouse

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

Summary

  • Adds a spark-gluten-clickhouse/ entry that runs Apache Spark with Apache Gluten configured to use the ClickHouse backend (spark.gluten.sql.columnar.backend.lib=ch). Gluten loads libch.so (a fork of ClickHouse v23.1) into the Spark executor JVM and runs the columnar physical plan natively through it.
  • Complements spark-gluten/ (Velox backend) and the proposed spark-velox/ (Add spark-velox entry (Spark + Velox via Apache Gluten) #858) — this entry exercises a meaningfully different execution path: Catalyst → Substrait → ClickHouse engine, rather than Catalyst → Substrait → Velox.

Build

No pre-built bundle is published for the CH backend (Apache Gluten v1.4.0 ships only the Velox bundle, and Maven Central has nothing). benchmark.sh therefore builds two things from source:

  1. libch.so — built from Kyligence/ClickHouse at the branch pinned in gluten/cpp-ch/clickhouse.version (currently rebase_ch/20250326). Uses Clang 18 / cmake / ninja.
  2. The Gluten Spark plugin — built via Maven with -Pbackends-clickhouse,spark-3.5,scala-2.12 under JDK 8.

Limitations

  • The libch.so compile is essentially a ClickHouse build and is RAM-hungry; Gluten's docs recommend ≥64 GB. On c6a.4xlarge (32 GB) it may OOM — c6a.8xlarge or larger is recommended for a clean run, hence the default machine label in benchmark.sh.
  • ARM is untested. Both ClickHouse and the Gluten plugin should compile on aarch64 in principle, but the Gluten CI does not publish CH-backend artifacts for ARM.

Notes

  • Queries use ClickHouse-style regex backreferences (\1) rather than Spark's $1, because regex evaluation runs inside libch.so. This was anticipated in the existing spark-gluten/README.md and Gluten issue #7545.
  • Memory split between Spark heap and the Gluten off-heap pool is 50/50, identical to the Velox entry — the CH backend also runs off-heap via JNI.

Test plan

  • Run on an x86_64 c6a.8xlarge.
  • Verify benchmark.sh clones gluten + Kyligence/ClickHouse, builds libch.so and the Spark plugin, runs all 43 queries, and writes results/<machine>.json.

🤖 Generated with Claude Code

Adds a spark-gluten-clickhouse/ entry that runs the ClickBench query
suite against Apache Spark with Apache Gluten configured to use the
ClickHouse backend ('ch'), in which Gluten loads libch.so (a fork of
ClickHouse v23.1) into the Spark executor JVM and runs the columnar
plan natively through it.

Compared with spark-gluten/ (which uses the Velox backend), this
exercises a meaningfully different execution path: Catalyst -> Substrait
-> ClickHouse engine, rather than Catalyst -> Substrait -> Velox.

No pre-built bundle is published for the CH backend (the Apache Gluten
release tarball ships only the Velox bundle), so benchmark.sh builds
both libch.so and the Gluten Spark plugin from source. The build is
memory-hungry; a 64 GB host (c6a.8xlarge or larger) is recommended.

Queries use ClickHouse-style regex backreferences (\1) since the regex
evaluation runs inside libch.so, as anticipated in the spark-gluten/
README.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant